<a href="https://github.com/dd-consulting">
<img src="../reference/GZ_logo.png" width="60" align="right">
</a>
<h1>
One-Stop Analytics: R
</h1>
Autism spectrum disorder (ASD) is a developmental disability that can cause significant social, communication and behavioral challenges. CDC is committed to continuing to provide essential data on ASD, search for factors that put children at risk for ASD and possible causes, and develop resources that help identify children with ASD as early as possible.
Doctors cited better awareness among parents and preschool teachers, leading to early referrals for diagnosis.
https://www.gov.sg/news/content/today-online-more-preschoolers-diagnosed-with-developmental-issues
<a href="">
</a>
https://www.cdc.gov/ncbddd/autism/data/index.html
<a href="">
</a>
<h3>
R Fundamentals - Get & Set working directory
</h3>
Obtain current R working directory
getwd()
## [1] "/media/sf_vm_shared_folder/git/DDC-ASD/model_R"
Set new R working directory
# setwd("/media/sf_vm_shared_folder/git/DDC/DDC-ASD/model_R")
# setwd('~/Desktop/admin-desktop/vm_shared_folder/git/DDC-ASD/model_R')
getwd()
## [1] "/media/sf_vm_shared_folder/git/DDC-ASD/model_R"
Read in CSV data, storing as R dataframe
# Dataset: US. National Level Children ASD Prevalence
ASD_National <- read.csv("../dataset/ADV_ASD_National.csv", stringsAsFactors = FALSE)
# Dataset: US. State Level Children ASD Prevalence
ASD_State <- read.csv("../dataset/ADV_ASD_State.csv", stringsAsFactors = FALSE)
Look at first/last few rows of data
head(ASD_National)
## Source Year Prevalence Upper.CI Lower.CI Prevalence_dup
## 1 addm 2000 6.7 7.0 6.3 6.7
## 2 addm 2002 6.6 6.8 6.3 6.6
## 3 addm 2004 8.0 8.4 7.6 8.0
## 4 addm 2006 9.0 9.3 8.6 9.0
## 5 addm 2008 11.3 11.7 11.0 11.3
## 6 addm 2010 14.7 15.1 14.3 14.7
## Source_Full1
## 1 Autism & Developmental Disabilities Monitoring Network
## 2 Autism & Developmental Disabilities Monitoring Network
## 3 Autism & Developmental Disabilities Monitoring Network
## 4 Autism & Developmental Disabilities Monitoring Network
## 5 Autism & Developmental Disabilities Monitoring Network
## 6 Autism & Developmental Disabilities Monitoring Network
## Source_Full2 Male.Prevalence
## 1 addm-Autism & Developmental Disabilities Monitoring Network No data
## 2 addm-Autism & Developmental Disabilities Monitoring Network 11.5
## 3 addm-Autism & Developmental Disabilities Monitoring Network 12.9
## 4 addm-Autism & Developmental Disabilities Monitoring Network 14.5
## 5 addm-Autism & Developmental Disabilities Monitoring Network 18.4
## 6 addm-Autism & Developmental Disabilities Monitoring Network 23.7
## Male.Lower.CI Male.Upper.CI Female.Prevalence Female.Lower.CI Female.Upper.CI
## 1 No data No data No data No data No data
## 2 No data No data 2.7 No data No data
## 3 12.2 13.7 2.9 2.6 3.3
## 4 13.9 15.1 3.2 2.9 3.5
## 5 17.7 19 4 3.7 4.3
## 6 23 24.4 5.3 5 5.7
## Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
## 1 No data No data
## 2 7.7 No data
## 3 9.7 9.1
## 4 9.9 9.4
## 5 12 11.5
## 6 15.8 15.2
## Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
## 1 No data No data
## 2 No data 6.5
## 3 10.4 6.9
## 4 10.4 7.2
## 5 12.5 10.2
## 6 16.3 12.3
## Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI Hispanic.Prevalence
## 1 No data No data No data
## 2 No data No data No data
## 3 6.2 7.6 6.2
## 4 6.6 7.8 5.9
## 5 9.5 10.9 7.9
## 6 11.5 13.1 10.8
## Hispanic.Lower.CI Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## 1 No data No data No data
## 2 No data No data No data
## 3 5 7.5 No data
## 4 5.3 6.6 No data
## 5 7.2 8.6 9.7
## 6 10 11.6 12.3
## Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## 1 No data No data
## 2 No data No data
## 3 No data No data
## 4 No data No data
## 5 8.1 11.6
## 6 10.7 14.2
tail(ASD_State)
## State Denominator Prevalence Lower.CI Upper.CI Year Source
## 1687 UT 596257 8.7 8.5 9.0 2016 sped
## 1688 VT 74108 12.1 11.3 12.9 2016 sped
## 1689 VA 1162945 14.2 14.0 14.4 2016 sped
## 1690 WA 1006676 11.2 11.0 11.4 2016 sped
## 1691 WV 239037 8.6 8.3 9.0 2016 sped
## 1692 WY 85922 9.3 8.7 10.0 2016 sped
## Source_Full1 State_Full1 State_Full2 Numerator_ASD
## 1687 Special Education Child Count Utah UT-Utah 5187
## 1688 Special Education Child Count Vermont VT-Vermont 897
## 1689 Special Education Child Count Virginia VA-Virginia 16514
## 1690 Special Education Child Count Washington WA-Washington 11275
## 1691 Special Education Child Count West Virginia WV-West Virginia 2056
## 1692 Special Education Child Count Wyoming WY-Wyoming 799
## Numerator_NonASD Proportion X95_Z_CI Z_Lower.CI Z_Upper.CI
## 1687 591070 0.008699269 0.000235709 8.463560 8.934978
## 1688 73211 0.012103956 0.000787290 11.316666 12.891247
## 1689 1146431 0.014200156 0.000215035 13.985121 14.415191
## 1690 995401 0.011200227 0.000205575 10.994652 11.405803
## 1691 236981 0.008601179 0.000370185 8.230994 8.971364
## 1692 85123 0.009299132 0.000641783 8.657349 9.940915
## Z_Lower.CI_ABSerror Z_Upper.CI_ABSerror Chi_Wilson_P X95_Chi_Wilson_CI
## 1687 0.036439666 0.065022456 0.008702434 0.000235729
## 1688 0.016666193 0.008753417 0.012129246 0.000787676
## 1689 0.014879497 0.015190775 0.014201760 0.000215041
## 1690 0.005347969 0.005802534 0.011202093 0.000205583
## 1691 0.069006017 0.028636189 0.008609076 0.000370266
## 1692 0.042651475 0.059084984 0.009321069 0.000642144
## Chi_Wilson_Lower.CI Chi_Wilson_Upper.CI Chi_Wilson_Lower.CI_ABSerror
## 1687 8.466705 8.938163 0.033294913
## 1688 11.341570 12.916922 0.041569768
## 1689 13.986720 14.416801 0.013280432
## 1690 10.996509 11.407676 0.003490794
## 1691 8.238810 8.979342 0.061190335
## 1692 8.678926 9.963213 0.021074361
## Chi_Wilson_Upper.CI_ABSerror Chi_Wilson_Corrected_w_minus.CI
## 1687 0.061836719 0.008465878
## 1688 0.016921499 0.011335040
## 1689 0.016801104 0.013986293
## 1690 0.007675848 0.010996017
## 1691 0.020658015 0.008236763
## 1692 0.036786888 0.008673305
## Chi_Wilson_Corrected_w_plus.CI Chi_Wilson_Corrected_Lower.CI
## 1687 0.008939013 8.465878
## 1688 0.012923885 11.335040
## 1689 0.014417234 13.986293
## 1690 0.011408177 10.996017
## 1691 0.008981478 8.236763
## 1692 0.009969231 8.673305
## Chi_Wilson_Corrected_Upper.CI Chi_Wilson_Corrected_Lower.CI_ABSerror
## 1687 8.939013 0.03412221
## 1688 12.923885 0.03503985
## 1689 14.417234 0.01370717
## 1690 11.408177 0.00398297
## 1691 8.981478 0.06323741
## 1692 9.969231 0.02669451
## Chi_Wilson_Corrected_Upper.CI_ABSerror Male.Prevalence Male.Lower.CI
## 1687 0.060986900 NA NA
## 1688 0.023884634 NA NA
## 1689 0.017234254 NA NA
## 1690 0.008177037 NA NA
## 1691 0.018521714 NA NA
## 1692 0.030769154 NA NA
## Male.Upper.CI Female.Prevalence Female.Lower.CI Female.Upper.CI
## 1687 NA NA NA NA
## 1688 NA NA NA NA
## 1689 NA NA NA NA
## 1690 NA NA NA NA
## 1691 NA NA NA NA
## 1692 NA NA NA NA
## Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
## 1687 NA NA
## 1688 NA NA
## 1689 NA NA
## 1690 NA NA
## 1691 NA NA
## 1692 NA NA
## Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
## 1687 NA
## 1688 NA
## 1689 NA
## 1690 NA
## 1691 NA
## 1692 NA
## Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI
## 1687
## 1688
## 1689
## 1690
## 1691
## 1692
## Hispanic.Prevalence Hispanic.Lower.CI Hispanic.Upper.CI
## 1687
## 1688
## 1689
## 1690
## 1691
## 1692
## Asian.or.Pacific.Islander.Prevalence Asian.or.Pacific.Islander.Lower.CI
## 1687
## 1688
## 1689
## 1690
## 1691
## 1692
## Asian.or.Pacific.Islander.Upper.CI State_Region
## 1687 D8 Mountain
## 1688 D1 New England
## 1689 D5 South Atlantic
## 1690 D9 Pacific
## 1691 D5 South Atlantic
## 1692 D8 Mountain
Obtain number of rows and number of columns/features/variables
dim(ASD_National)
## [1] 42 26
dim(ASD_State)
## [1] 1692 49
Obtain overview (data structure/types)
str(ASD_National)
## 'data.frame': 42 obs. of 26 variables:
## $ Source : chr "addm" "addm" "addm" "addm" ...
## $ Year : int 2000 2002 2004 2006 2008 2010 2012 2014 2004 2008 ...
## $ Prevalence : num 6.7 6.6 8 9 11.3 14.7 14.8 16.8 9.5 16.2 ...
## $ Upper.CI : num 7 6.8 8.4 9.3 11.7 15.1 15.2 17.3 12 18.1 ...
## $ Lower.CI : num 6.3 6.3 7.6 8.6 11 14.3 14.4 16.4 7.4 14.5 ...
## $ Prevalence_dup : num 6.7 6.6 8 9 11.3 14.7 14.8 16.8 9.5 16.2 ...
## $ Source_Full1 : chr "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" ...
## $ Source_Full2 : chr "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" ...
## $ Male.Prevalence : chr "No data" "11.5" "12.9" "14.5" ...
## $ Male.Lower.CI : chr "No data" "No data" "12.2" "13.9" ...
## $ Male.Upper.CI : chr "No data" "No data" "13.7" "15.1" ...
## $ Female.Prevalence : chr "No data" "2.7" "2.9" "3.2" ...
## $ Female.Lower.CI : chr "No data" "No data" "2.6" "2.9" ...
## $ Female.Upper.CI : chr "No data" "No data" "3.3" "3.5" ...
## $ Non.hispanic.white.Prevalence : chr "No data" "7.7" "9.7" "9.9" ...
## $ Non.hispanic.white.Lower.CI : chr "No data" "No data" "9.1" "9.4" ...
## $ Non.hispanic.white.Upper.CI : chr "No data" "No data" "10.4" "10.4" ...
## $ Non.hispanic.black.Prevalence : chr "No data" "6.5" "6.9" "7.2" ...
## $ Non.hispanic.black.Lower.CI : chr "No data" "No data" "6.2" "6.6" ...
## $ Non.hispanic.black.Upper.CI : chr "No data" "No data" "7.6" "7.8" ...
## $ Hispanic.Prevalence : chr "No data" "No data" "6.2" "5.9" ...
## $ Hispanic.Lower.CI : chr "No data" "No data" "5" "5.3" ...
## $ Hispanic.Upper.CI : chr "No data" "No data" "7.5" "6.6" ...
## $ Asian.or.Pacific.Islander.Prevalence: chr "No data" "No data" "No data" "No data" ...
## $ Asian.or.Pacific.Islander.Lower.CI : chr "No data" "No data" "No data" "No data" ...
## $ Asian.or.Pacific.Islander.Upper.CI : chr "No data" "No data" "No data" "No data" ...
str(ASD_State)
## 'data.frame': 1692 obs. of 49 variables:
## $ State : chr "AZ" "GA" "MD" "NJ" ...
## $ Denominator : int 45322 43593 21532 29714 24535 23065 35472 45113 36472 11020 ...
## $ Prevalence : num 6.5 6.5 5.5 9.9 6.3 4.5 3.3 6.2 6.9 5.9 ...
## $ Lower.CI : num 5.8 5.8 4.6 8.9 5.4 3.7 2.7 5.5 6.1 4.6 ...
## $ Upper.CI : num 7.3 7.3 6.6 11.1 7.4 5.5 3.9 7 7.8 7.5 ...
## $ Year : int 2000 2000 2000 2000 2000 2000 2002 2002 2002 2002 ...
## $ Source : chr "addm" "addm" "addm" "addm" ...
## $ Source_Full1 : chr "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" ...
## $ State_Full1 : chr "Arizona" "Georgia" "Maryland" "New Jersey" ...
## $ State_Full2 : chr "AZ-Arizona" "GA-Georgia" "MD-Maryland" "NJ-New Jersey" ...
## $ Numerator_ASD : int 295 283 118 294 155 104 117 280 252 65 ...
## $ Numerator_NonASD : int 45027 43310 21414 29420 24380 22961 35355 44833 36220 10955 ...
## $ Proportion : num 0.00651 0.00649 0.00548 0.00989 0.00632 ...
## $ X95_Z_CI : num 0.00074 0.000754 0.000986 0.001125 0.000991 ...
## $ Z_Lower.CI : num 5.77 5.74 4.49 8.77 5.33 ...
## $ Z_Upper.CI : num 7.25 7.25 6.47 11.02 7.31 ...
## $ Z_Lower.CI_ABSerror : num 0.0314 0.062 0.1059 0.1311 0.0739 ...
## $ Z_Upper.CI_ABSerror : num 0.0507 0.0542 0.1337 0.0803 0.0911 ...
## $ Chi_Wilson_P : num 0.00655 0.00654 0.00557 0.00996 0.00639 ...
## $ X95_Chi_Wilson_CI : num 0.000741 0.000755 0.00099 0.001127 0.000994 ...
## $ Chi_Wilson_Lower.CI : num 5.81 5.78 4.58 8.83 5.4 ...
## $ Chi_Wilson_Upper.CI : num 7.29 7.29 6.56 11.08 7.39 ...
## $ Chi_Wilson_Lower.CI_ABSerror : num 0.009314 0.019761 0.021503 0.069416 0.000453 ...
## $ Chi_Wilson_Upper.CI_ABSerror : num 0.0077 0.00953 0.04165 0.01523 0.01087 ...
## $ Chi_Wilson_Corrected_w_minus.CI : num 0.0058 0.00577 0.00456 0.00881 0.00538 ...
## $ Chi_Wilson_Corrected_w_plus.CI : num 0.0073 0.0073 0.00658 0.0111 0.00741 ...
## $ Chi_Wilson_Corrected_Lower.CI : num 5.8 5.77 4.56 8.81 5.38 ...
## $ Chi_Wilson_Corrected_Upper.CI : num 7.3 7.3 6.58 11.1 7.41 ...
## $ Chi_Wilson_Corrected_Lower.CI_ABSerror: num 0.00109 0.03057 0.04265 0.08529 0.01834 ...
## $ Chi_Wilson_Corrected_Upper.CI_ABSerror: num 0.00395 0.0026 0.01636 0.00254 0.01108 ...
## $ Male.Prevalence : num 9.7 11 8.6 14.8 9.3 6.6 5 10.1 10.7 9.9 ...
## $ Male.Lower.CI : num 8.5 9.7 7.1 13 7.8 5.2 4.1 8.8 9.3 7.6 ...
## $ Male.Upper.CI : num 11.1 12.4 10.6 16.8 11.2 8.2 6.2 11.4 12.3 12.9 ...
## $ Female.Prevalence : num 3.2 2 2.2 4.3 3.3 2.4 1.4 2.2 2.9 1.7 ...
## $ Female.Lower.CI : num 2.5 1.5 1.5 3.3 2.4 1.6 0.9 1.7 2.2 0.9 ...
## $ Female.Upper.CI : num 4 2.7 2.7 5.5 4.5 3.5 2.1 2.9 3.8 3.2 ...
## $ Non.hispanic.white.Prevalence : num 8.6 7.9 4.9 11.3 6.5 4.5 3.3 7.7 7.4 6.4 ...
## $ Non.hispanic.white.Lower.CI : num 7.5 6.7 3.8 9.5 5.2 3.7 2.6 6.7 6.5 4.8 ...
## $ Non.hispanic.white.Upper.CI : num 9.8 9.3 6.4 13.3 8.2 5.5 4.1 8.9 8.6 8.5 ...
## $ Non.hispanic.black.Prevalence : chr "7.3" "5.3" "6.1" "10.6" ...
## $ Non.hispanic.black.Lower.CI : chr "4.4" "4.4" "4.7" "8.5" ...
## $ Non.hispanic.black.Upper.CI : chr "12.2" "6.4" "8" "13.1" ...
## $ Hispanic.Prevalence : chr "No data" "No data" "No data" "No data" ...
## $ Hispanic.Lower.CI : chr "No data" "No data" "No data" "No data" ...
## $ Hispanic.Upper.CI : chr "No data" "No data" "No data" "No data" ...
## $ Asian.or.Pacific.Islander.Prevalence : chr "No data" "No data" "No data" "No data" ...
## $ Asian.or.Pacific.Islander.Lower.CI : chr "No data" "No data" "No data" "No data" ...
## $ Asian.or.Pacific.Islander.Upper.CI : chr "No data" "No data" "No data" "No data" ...
## $ State_Region : chr "D8 Mountain" "D5 South Atlantic" "D5 South Atlantic" "D2 Middle Atlantic" ...
Obtain name of columns
names(ASD_National)
## [1] "Source"
## [2] "Year"
## [3] "Prevalence"
## [4] "Upper.CI"
## [5] "Lower.CI"
## [6] "Prevalence_dup"
## [7] "Source_Full1"
## [8] "Source_Full2"
## [9] "Male.Prevalence"
## [10] "Male.Lower.CI"
## [11] "Male.Upper.CI"
## [12] "Female.Prevalence"
## [13] "Female.Lower.CI"
## [14] "Female.Upper.CI"
## [15] "Non.hispanic.white.Prevalence"
## [16] "Non.hispanic.white.Lower.CI"
## [17] "Non.hispanic.white.Upper.CI"
## [18] "Non.hispanic.black.Prevalence"
## [19] "Non.hispanic.black.Lower.CI"
## [20] "Non.hispanic.black.Upper.CI"
## [21] "Hispanic.Prevalence"
## [22] "Hispanic.Lower.CI"
## [23] "Hispanic.Upper.CI"
## [24] "Asian.or.Pacific.Islander.Prevalence"
## [25] "Asian.or.Pacific.Islander.Lower.CI"
## [26] "Asian.or.Pacific.Islander.Upper.CI"
names(ASD_State)
## [1] "State"
## [2] "Denominator"
## [3] "Prevalence"
## [4] "Lower.CI"
## [5] "Upper.CI"
## [6] "Year"
## [7] "Source"
## [8] "Source_Full1"
## [9] "State_Full1"
## [10] "State_Full2"
## [11] "Numerator_ASD"
## [12] "Numerator_NonASD"
## [13] "Proportion"
## [14] "X95_Z_CI"
## [15] "Z_Lower.CI"
## [16] "Z_Upper.CI"
## [17] "Z_Lower.CI_ABSerror"
## [18] "Z_Upper.CI_ABSerror"
## [19] "Chi_Wilson_P"
## [20] "X95_Chi_Wilson_CI"
## [21] "Chi_Wilson_Lower.CI"
## [22] "Chi_Wilson_Upper.CI"
## [23] "Chi_Wilson_Lower.CI_ABSerror"
## [24] "Chi_Wilson_Upper.CI_ABSerror"
## [25] "Chi_Wilson_Corrected_w_minus.CI"
## [26] "Chi_Wilson_Corrected_w_plus.CI"
## [27] "Chi_Wilson_Corrected_Lower.CI"
## [28] "Chi_Wilson_Corrected_Upper.CI"
## [29] "Chi_Wilson_Corrected_Lower.CI_ABSerror"
## [30] "Chi_Wilson_Corrected_Upper.CI_ABSerror"
## [31] "Male.Prevalence"
## [32] "Male.Lower.CI"
## [33] "Male.Upper.CI"
## [34] "Female.Prevalence"
## [35] "Female.Lower.CI"
## [36] "Female.Upper.CI"
## [37] "Non.hispanic.white.Prevalence"
## [38] "Non.hispanic.white.Lower.CI"
## [39] "Non.hispanic.white.Upper.CI"
## [40] "Non.hispanic.black.Prevalence"
## [41] "Non.hispanic.black.Lower.CI"
## [42] "Non.hispanic.black.Upper.CI"
## [43] "Hispanic.Prevalence"
## [44] "Hispanic.Lower.CI"
## [45] "Hispanic.Upper.CI"
## [46] "Asian.or.Pacific.Islander.Prevalence"
## [47] "Asian.or.Pacific.Islander.Lower.CI"
## [48] "Asian.or.Pacific.Islander.Upper.CI"
## [49] "State_Region"
Display column name with its index number
cbind(names(ASD_National), c(1:length(names(ASD_National))))
## [,1] [,2]
## [1,] "Source" "1"
## [2,] "Year" "2"
## [3,] "Prevalence" "3"
## [4,] "Upper.CI" "4"
## [5,] "Lower.CI" "5"
## [6,] "Prevalence_dup" "6"
## [7,] "Source_Full1" "7"
## [8,] "Source_Full2" "8"
## [9,] "Male.Prevalence" "9"
## [10,] "Male.Lower.CI" "10"
## [11,] "Male.Upper.CI" "11"
## [12,] "Female.Prevalence" "12"
## [13,] "Female.Lower.CI" "13"
## [14,] "Female.Upper.CI" "14"
## [15,] "Non.hispanic.white.Prevalence" "15"
## [16,] "Non.hispanic.white.Lower.CI" "16"
## [17,] "Non.hispanic.white.Upper.CI" "17"
## [18,] "Non.hispanic.black.Prevalence" "18"
## [19,] "Non.hispanic.black.Lower.CI" "19"
## [20,] "Non.hispanic.black.Upper.CI" "20"
## [21,] "Hispanic.Prevalence" "21"
## [22,] "Hispanic.Lower.CI" "22"
## [23,] "Hispanic.Upper.CI" "23"
## [24,] "Asian.or.Pacific.Islander.Prevalence" "24"
## [25,] "Asian.or.Pacific.Islander.Lower.CI" "25"
## [26,] "Asian.or.Pacific.Islander.Upper.CI" "26"
Look at data structure/schema (Selected columns)
str(ASD_National[, c(1:8, 24, 25, 26)])
## 'data.frame': 42 obs. of 11 variables:
## $ Source : chr "addm" "addm" "addm" "addm" ...
## $ Year : int 2000 2002 2004 2006 2008 2010 2012 2014 2004 2008 ...
## $ Prevalence : num 6.7 6.6 8 9 11.3 14.7 14.8 16.8 9.5 16.2 ...
## $ Upper.CI : num 7 6.8 8.4 9.3 11.7 15.1 15.2 17.3 12 18.1 ...
## $ Lower.CI : num 6.3 6.3 7.6 8.6 11 14.3 14.4 16.4 7.4 14.5 ...
## $ Prevalence_dup : num 6.7 6.6 8 9 11.3 14.7 14.8 16.8 9.5 16.2 ...
## $ Source_Full1 : chr "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" ...
## $ Source_Full2 : chr "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" ...
## $ Asian.or.Pacific.Islander.Prevalence: chr "No data" "No data" "No data" "No data" ...
## $ Asian.or.Pacific.Islander.Lower.CI : chr "No data" "No data" "No data" "No data" ...
## $ Asian.or.Pacific.Islander.Upper.CI : chr "No data" "No data" "No data" "No data" ...
<h3>
Quiz:
</h3>
<p>
Obtain feature/column names and column index of dataframe: ASD_State
</p>
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
<h3>
R Fundamentals - Work with dataframe
</h3>
Access column 1 as a named list:
# use column index:
ASD_National[1]
## Source
## 1 addm
## 2 addm
## 3 addm
## 4 addm
## 5 addm
## 6 addm
## 7 addm
## 8 addm
## 9 nsch
## 10 nsch
## 11 nsch
## 12 nsch
## 13 sped
## 14 sped
## 15 sped
## 16 sped
## 17 sped
## 18 sped
## 19 sped
## 20 sped
## 21 sped
## 22 sped
## 23 sped
## 24 sped
## 25 sped
## 26 sped
## 27 sped
## 28 sped
## 29 sped
## 30 medi
## 31 medi
## 32 medi
## 33 medi
## 34 medi
## 35 medi
## 36 medi
## 37 medi
## 38 medi
## 39 medi
## 40 medi
## 41 medi
## 42 medi
typeof(ASD_National[1])
## [1] "list"
ASD_National[1]$Source
## [1] "addm" "addm" "addm" "addm" "addm" "addm" "addm" "addm" "nsch" "nsch"
## [11] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
## [21] "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "medi"
## [31] "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi"
## [41] "medi" "medi"
typeof(ASD_National[1]$Source)
## [1] "character"
# use column name:
ASD_National["Source"]
## Source
## 1 addm
## 2 addm
## 3 addm
## 4 addm
## 5 addm
## 6 addm
## 7 addm
## 8 addm
## 9 nsch
## 10 nsch
## 11 nsch
## 12 nsch
## 13 sped
## 14 sped
## 15 sped
## 16 sped
## 17 sped
## 18 sped
## 19 sped
## 20 sped
## 21 sped
## 22 sped
## 23 sped
## 24 sped
## 25 sped
## 26 sped
## 27 sped
## 28 sped
## 29 sped
## 30 medi
## 31 medi
## 32 medi
## 33 medi
## 34 medi
## 35 medi
## 36 medi
## 37 medi
## 38 medi
## 39 medi
## 40 medi
## 41 medi
## 42 medi
ASD_National['Source']$Source
## [1] "addm" "addm" "addm" "addm" "addm" "addm" "addm" "addm" "nsch" "nsch"
## [11] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
## [21] "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "medi"
## [31] "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi"
## [41] "medi" "medi"
Access column 1 as a set of string/chr:
ASD_National[, 1]
## [1] "addm" "addm" "addm" "addm" "addm" "addm" "addm" "addm" "nsch" "nsch"
## [11] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
## [21] "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "medi"
## [31] "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi"
## [41] "medi" "medi"
# or
ASD_National[, "Source"]
## [1] "addm" "addm" "addm" "addm" "addm" "addm" "addm" "addm" "nsch" "nsch"
## [11] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
## [21] "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "medi"
## [31] "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi"
## [41] "medi" "medi"
# or
ASD_National$Source
## [1] "addm" "addm" "addm" "addm" "addm" "addm" "addm" "addm" "nsch" "nsch"
## [11] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
## [21] "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "medi"
## [31] "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi" "medi"
## [41] "medi" "medi"
typeof(ASD_National$Source)
## [1] "character"
Count number of elements in a object:
length(ASD_National) # number of features/columns
## [1] 26
length(ASD_National[1, ]) # number of elements(columns) in row 1
## [1] 26
length(ASD_National[, 1]) # number of elements(rows) in column 1
## [1] 42
length(ASD_National[, "Source"]) # same as above
## [1] 42
length(ASD_National$Source) # number of elements in chr list
## [1] 42
Access elements from dataframe
# using column index
ASD_National[1][1, ]
## [1] "addm"
ASD_National[1][11, ]
## [1] "nsch"
ASD_National[1][11:20, ]
## [1] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
# using column name
ASD_National["Source"][1, ]
## [1] "addm"
ASD_National["Source"][11, ]
## [1] "nsch"
ASD_National["Source"][11:20, ]
## [1] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
Access elements from dataframe
# using column index
ASD_National[, 1][1]
## [1] "addm"
ASD_National[, 1][11]
## [1] "nsch"
ASD_National[, 1][11:20]
## [1] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
# using column name
ASD_National[, "Source"][1]
## [1] "addm"
# using column name
ASD_National[, "Source"][11]
## [1] "nsch"
# using column name
ASD_National[, "Source"][11:20]
## [1] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
# using $ operator
ASD_National$Source[1]
## [1] "addm"
ASD_National$Source[11]
## [1] "nsch"
ASD_National$Source[11:20]
## [1] "nsch" "nsch" "sped" "sped" "sped" "sped" "sped" "sped" "sped" "sped"
Access elements of different column:
cbind(names(ASD_National), c(1:length(names(ASD_National))))
## [,1] [,2]
## [1,] "Source" "1"
## [2,] "Year" "2"
## [3,] "Prevalence" "3"
## [4,] "Upper.CI" "4"
## [5,] "Lower.CI" "5"
## [6,] "Prevalence_dup" "6"
## [7,] "Source_Full1" "7"
## [8,] "Source_Full2" "8"
## [9,] "Male.Prevalence" "9"
## [10,] "Male.Lower.CI" "10"
## [11,] "Male.Upper.CI" "11"
## [12,] "Female.Prevalence" "12"
## [13,] "Female.Lower.CI" "13"
## [14,] "Female.Upper.CI" "14"
## [15,] "Non.hispanic.white.Prevalence" "15"
## [16,] "Non.hispanic.white.Lower.CI" "16"
## [17,] "Non.hispanic.white.Upper.CI" "17"
## [18,] "Non.hispanic.black.Prevalence" "18"
## [19,] "Non.hispanic.black.Lower.CI" "19"
## [20,] "Non.hispanic.black.Upper.CI" "20"
## [21,] "Hispanic.Prevalence" "21"
## [22,] "Hispanic.Lower.CI" "22"
## [23,] "Hispanic.Upper.CI" "23"
## [24,] "Asian.or.Pacific.Islander.Prevalence" "24"
## [25,] "Asian.or.Pacific.Islander.Lower.CI" "25"
## [26,] "Asian.or.Pacific.Islander.Upper.CI" "26"
ASD_National[1, 1] # row 1, column 1: "Source"
## [1] "addm"
ASD_National[10, 1] # row 10, column 1: "Source"
## [1] "nsch"
ASD_National[1, 3] # row 1, column 3: "Prevalence"
## [1] 6.7
ASD_National[10, 3] # row 10, column 3: "Prevalence"
## [1] 16.2
ASD_National[1:10, 1:3] # row 1 to 10 from column 1 to 3
## Source Year Prevalence
## 1 addm 2000 6.7
## 2 addm 2002 6.6
## 3 addm 2004 8.0
## 4 addm 2006 9.0
## 5 addm 2008 11.3
## 6 addm 2010 14.7
## 7 addm 2012 14.8
## 8 addm 2014 16.8
## 9 nsch 2004 9.5
## 10 nsch 2008 16.2
# or using columns names
ASD_National[1:10, c('Source', 'Year', 'Prevalence')]
## Source Year Prevalence
## 1 addm 2000 6.7
## 2 addm 2002 6.6
## 3 addm 2004 8.0
## 4 addm 2006 9.0
## 5 addm 2008 11.3
## 6 addm 2010 14.7
## 7 addm 2012 14.8
## 8 addm 2014 16.8
## 9 nsch 2004 9.5
## 10 nsch 2008 16.2
ASD_National[c(1:10, 20, 30:35), c(1:3, 9, 12)] # row 1 to 10, 20, and 20 to 25 from column 1 to 3, 9, and 12
## Source Year Prevalence Male.Prevalence Female.Prevalence
## 1 addm 2000 6.7 No data No data
## 2 addm 2002 6.6 11.5 2.7
## 3 addm 2004 8.0 12.9 2.9
## 4 addm 2006 9.0 14.5 3.2
## 5 addm 2008 11.3 18.4 4
## 6 addm 2010 14.7 23.7 5.3
## 7 addm 2012 14.8 23.4 5.2
## 8 addm 2014 16.8 26.6 6.6
## 9 nsch 2004 9.5
## 10 nsch 2008 16.2
## 20 sped 2007 5.4
## 30 medi 2000 2.3
## 31 medi 2001 2.6
## 32 medi 2002 2.8
## 33 medi 2003 3.0
## 34 medi 2004 3.5
## 35 medi 2005 3.9
[ Tips ] We notice missing data from above.
<h3>
R Fundamentals - Process missing data
</h3>
Count missing values in dataframe:
sum(is.na(ASD_National)) # No missing data recognised by R (NA)
## [1] 0
sum(is.na(ASD_State)) # Some missing data recognised by R (NA)
## [1] 14454
Empty string, “No data” are not considered as missing value by R, thus we need to handle them manually.
# Define several offending strings
na_strings <- c("", "No data", "NA", "N A", "N / A", "N/A", "N/ A", "Not Available", "NOt available")
# Load required function from packages:
if(!require(naniar)){install.packages("naniar")}
## Loading required package: naniar
library(naniar)
if(!require(dplyr)){install.packages("dplyr")}
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(dplyr)
# Uncomment below to show help
# ?replace_with_na_all # Documentation
Replace these defined missing/offending values to R’s internal NA
# "~.x" is a reserved keyword of this function:
ASD_National = replace_with_na_all(ASD_National, condition = ~.x %in% na_strings)
# Count missing values (R's internal NA) in dataframe:
sum(is.na(ASD_National))
## [1] 650
<h3>
R Fundamentals - Process invalid characters
</h3>
Remove invalid unicode char/string: 92
ASD_National$Source_Full1[ASD_National$Source_Full1 == "National Survey of Children\x92s Health"] <-
"National Survey of Children's Health"
ASD_National$Source_Full2[ASD_National$Source_Full2 == "nsch-National Survey of Children\x92s Health"] <-
"nsch-National Survey of Children's Health"
<h3>
R Fundamentals - Delete/Drop dataframe variable
</h3>
Delete/Drop duplicate variable: Prevalence_dup
drop <- c("Prevalence_dup", "Dummy Variable Name")
ASD_National = ASD_National[, !(names(ASD_National) %in% drop)] # Recall Dataframe[rows,columns]
<h3>
R Fundamentals - Create/Add dataframe variable
</h3>
Create one new variable: Source_UC by converting to uppercase letters
ASD_National$Source_UC <- paste(toupper(ASD_National$Source))
Create one new variable: Source_Full3 by combining Source and Source_Full1
ASD_National$Source_Full3 <- paste(toupper(ASD_National$Source), ASD_National$Source_Full1)
Create one new ordinal categorical variable: Prevalence_Rank2 (“Low”, “High”) by binning Prevalence
# Recode Risk into category from Prevalence
# Low [0, 5)
# High [5, +oo)
ASD_National$Prevalence_Risk2[ASD_National$Prevalence < 5] = "Low"
## Warning: Unknown or uninitialised column: 'Prevalence_Risk2'.
ASD_National$Prevalence_Risk2[ASD_National$Prevalence >= 5 ] = "High"
#
head(ASD_National)
## # A tibble: 6 x 28
## Source Year Prevalence Upper.CI Lower.CI Source_Full1 Source_Full2
## <chr> <int> <dbl> <dbl> <dbl> <chr> <chr>
## 1 addm 2000 6.7 7 6.3 Autism & De… addm-Autism…
## 2 addm 2002 6.6 6.8 6.3 Autism & De… addm-Autism…
## 3 addm 2004 8 8.4 7.6 Autism & De… addm-Autism…
## 4 addm 2006 9 9.3 8.6 Autism & De… addm-Autism…
## 5 addm 2008 11.3 11.7 11 Autism & De… addm-Autism…
## 6 addm 2010 14.7 15.1 14.3 Autism & De… addm-Autism…
## # … with 21 more variables: Male.Prevalence <chr>, Male.Lower.CI <chr>,
## # Male.Upper.CI <chr>, Female.Prevalence <chr>, Female.Lower.CI <chr>,
## # Female.Upper.CI <chr>, Non.hispanic.white.Prevalence <chr>,
## # Non.hispanic.white.Lower.CI <chr>, Non.hispanic.white.Upper.CI <chr>,
## # Non.hispanic.black.Prevalence <chr>, Non.hispanic.black.Lower.CI <chr>,
## # Non.hispanic.black.Upper.CI <chr>, Hispanic.Prevalence <chr>,
## # Hispanic.Lower.CI <chr>, Hispanic.Upper.CI <chr>,
## # Asian.or.Pacific.Islander.Prevalence <chr>,
## # Asian.or.Pacific.Islander.Lower.CI <chr>,
## # Asian.or.Pacific.Islander.Upper.CI <chr>, Source_UC <chr>,
## # Source_Full3 <chr>, Prevalence_Risk2 <chr>
Create one new ordinal categorical variable: Prevalence_Rank4 (“Low”, “Medium”, “High”, “Very High”) by binning Prevalence
# Recode Risk into category from Prevalence
# Low [0, 5)
# Medium [5, 10)
# High [10, 20)
# Very High [20, +oo)
ASD_National$Prevalence_Risk4 = "Very High"
ASD_National$Prevalence_Risk4[ASD_National$Prevalence < 20 ] = "High"
ASD_National$Prevalence_Risk4[ASD_National$Prevalence < 10 ] = "Medium"
ASD_National$Prevalence_Risk4[ASD_National$Prevalence < 5] = "Low"
#
head(ASD_National)
## # A tibble: 6 x 29
## Source Year Prevalence Upper.CI Lower.CI Source_Full1 Source_Full2
## <chr> <int> <dbl> <dbl> <dbl> <chr> <chr>
## 1 addm 2000 6.7 7 6.3 Autism & De… addm-Autism…
## 2 addm 2002 6.6 6.8 6.3 Autism & De… addm-Autism…
## 3 addm 2004 8 8.4 7.6 Autism & De… addm-Autism…
## 4 addm 2006 9 9.3 8.6 Autism & De… addm-Autism…
## 5 addm 2008 11.3 11.7 11 Autism & De… addm-Autism…
## 6 addm 2010 14.7 15.1 14.3 Autism & De… addm-Autism…
## # … with 22 more variables: Male.Prevalence <chr>, Male.Lower.CI <chr>,
## # Male.Upper.CI <chr>, Female.Prevalence <chr>, Female.Lower.CI <chr>,
## # Female.Upper.CI <chr>, Non.hispanic.white.Prevalence <chr>,
## # Non.hispanic.white.Lower.CI <chr>, Non.hispanic.white.Upper.CI <chr>,
## # Non.hispanic.black.Prevalence <chr>, Non.hispanic.black.Lower.CI <chr>,
## # Non.hispanic.black.Upper.CI <chr>, Hispanic.Prevalence <chr>,
## # Hispanic.Lower.CI <chr>, Hispanic.Upper.CI <chr>,
## # Asian.or.Pacific.Islander.Prevalence <chr>,
## # Asian.or.Pacific.Islander.Lower.CI <chr>,
## # Asian.or.Pacific.Islander.Upper.CI <chr>, Source_UC <chr>,
## # Source_Full3 <chr>, Prevalence_Risk2 <chr>, Prevalence_Risk4 <chr>
<h3>
R Fundamentals - Convert to correct data types
</h3>
Review data structure and variable names:
str(ASD_National)
## Classes 'tbl_df', 'tbl' and 'data.frame': 42 obs. of 29 variables:
## $ Source : chr "addm" "addm" "addm" "addm" ...
## $ Year : int 2000 2002 2004 2006 2008 2010 2012 2014 2004 2008 ...
## $ Prevalence : num 6.7 6.6 8 9 11.3 14.7 14.8 16.8 9.5 16.2 ...
## $ Upper.CI : num 7 6.8 8.4 9.3 11.7 15.1 15.2 17.3 12 18.1 ...
## $ Lower.CI : num 6.3 6.3 7.6 8.6 11 14.3 14.4 16.4 7.4 14.5 ...
## $ Source_Full1 : chr "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" ...
## $ Source_Full2 : chr "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" "addm-Autism & Developmental Disabilities Monitoring Network" ...
## $ Male.Prevalence : chr NA "11.5" "12.9" "14.5" ...
## $ Male.Lower.CI : chr NA NA "12.2" "13.9" ...
## $ Male.Upper.CI : chr NA NA "13.7" "15.1" ...
## $ Female.Prevalence : chr NA "2.7" "2.9" "3.2" ...
## $ Female.Lower.CI : chr NA NA "2.6" "2.9" ...
## $ Female.Upper.CI : chr NA NA "3.3" "3.5" ...
## $ Non.hispanic.white.Prevalence : chr NA "7.7" "9.7" "9.9" ...
## $ Non.hispanic.white.Lower.CI : chr NA NA "9.1" "9.4" ...
## $ Non.hispanic.white.Upper.CI : chr NA NA "10.4" "10.4" ...
## $ Non.hispanic.black.Prevalence : chr NA "6.5" "6.9" "7.2" ...
## $ Non.hispanic.black.Lower.CI : chr NA NA "6.2" "6.6" ...
## $ Non.hispanic.black.Upper.CI : chr NA NA "7.6" "7.8" ...
## $ Hispanic.Prevalence : chr NA NA "6.2" "5.9" ...
## $ Hispanic.Lower.CI : chr NA NA "5" "5.3" ...
## $ Hispanic.Upper.CI : chr NA NA "7.5" "6.6" ...
## $ Asian.or.Pacific.Islander.Prevalence: chr NA NA NA NA ...
## $ Asian.or.Pacific.Islander.Lower.CI : chr NA NA NA NA ...
## $ Asian.or.Pacific.Islander.Upper.CI : chr NA NA NA NA ...
## $ Source_UC : chr "ADDM" "ADDM" "ADDM" "ADDM" ...
## $ Source_Full3 : chr "ADDM Autism & Developmental Disabilities Monitoring Network" "ADDM Autism & Developmental Disabilities Monitoring Network" "ADDM Autism & Developmental Disabilities Monitoring Network" "ADDM Autism & Developmental Disabilities Monitoring Network" ...
## $ Prevalence_Risk2 : chr "High" "High" "High" "High" ...
## $ Prevalence_Risk4 : chr "Medium" "Medium" "Medium" "Medium" ...
cbind(names(ASD_National), c(1:length(names(ASD_National))))
## [,1] [,2]
## [1,] "Source" "1"
## [2,] "Year" "2"
## [3,] "Prevalence" "3"
## [4,] "Upper.CI" "4"
## [5,] "Lower.CI" "5"
## [6,] "Source_Full1" "6"
## [7,] "Source_Full2" "7"
## [8,] "Male.Prevalence" "8"
## [9,] "Male.Lower.CI" "9"
## [10,] "Male.Upper.CI" "10"
## [11,] "Female.Prevalence" "11"
## [12,] "Female.Lower.CI" "12"
## [13,] "Female.Upper.CI" "13"
## [14,] "Non.hispanic.white.Prevalence" "14"
## [15,] "Non.hispanic.white.Lower.CI" "15"
## [16,] "Non.hispanic.white.Upper.CI" "16"
## [17,] "Non.hispanic.black.Prevalence" "17"
## [18,] "Non.hispanic.black.Lower.CI" "18"
## [19,] "Non.hispanic.black.Upper.CI" "19"
## [20,] "Hispanic.Prevalence" "20"
## [21,] "Hispanic.Lower.CI" "21"
## [22,] "Hispanic.Upper.CI" "22"
## [23,] "Asian.or.Pacific.Islander.Prevalence" "23"
## [24,] "Asian.or.Pacific.Islander.Lower.CI" "24"
## [25,] "Asian.or.Pacific.Islander.Upper.CI" "25"
## [26,] "Source_UC" "26"
## [27,] "Source_Full3" "27"
## [28,] "Prevalence_Risk2" "28"
## [29,] "Prevalence_Risk4" "29"
Convert Prevalence and CIs from categorical/chr to numeric, column 8 to 25
ix <- 8:25 # define an index
# apply()
ASD_National[ix] <- apply(ASD_National[ix], 2, as.numeric) # "2" meand column-wise; "1" means row-wise.
# Uncomment below to show help
# ?apply # Documentation
# or lapply()
ASD_National[ix] <- lapply(ASD_National[ix], as.numeric) # column-wise
# Uncomment below to show help
# ?lapply # Documentation
Convert Source from categorical/chr to categorical/factor
ix <- c(1, 6, 7, 26, 27) # define an index
ASD_National[ix] <- lapply(ASD_National[ix], as.factor)
Create new ordered factor Year_Factor from Year
ASD_National$Year_Factor <- factor(ASD_National$Year, ordered = TRUE)
# Observe the difference of 'Levels' in below two factors
ASD_National$Year_Factor # Ordinal categorical variable
## [1] 2000 2002 2004 2006 2008 2010 2012 2014 2004 2008 2012 2016 2000 2001 2002
## [16] 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2000
## [31] 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
## 17 Levels: 2000 < 2001 < 2002 < 2003 < 2004 < 2005 < 2006 < 2007 < ... < 2016
str(ASD_National$Year_Factor)
## Ord.factor w/ 17 levels "2000"<"2001"<..: 1 3 5 7 9 11 13 15 5 9 ...
ASD_National$Source # Nominal categorical variable
## [1] addm addm addm addm addm addm addm addm nsch nsch nsch nsch sped sped sped
## [16] sped sped sped sped sped sped sped sped sped sped sped sped sped sped medi
## [31] medi medi medi medi medi medi medi medi medi medi medi medi
## Levels: addm medi nsch sped
str(ASD_National$Source)
## Factor w/ 4 levels "addm","medi",..: 1 1 1 1 1 1 1 1 3 3 ...
Convert Prevalence_Rank2 & Prevalence_Rank4 to ordered factor
# Convert to factor
ASD_National$Prevalence_Risk2 = factor(ASD_National$Prevalence_Risk2, ordered=TRUE,
levels=c("Low", "High"))
# Convert to factor
ASD_National$Prevalence_Risk4 = factor(ASD_National$Prevalence_Risk4, ordered=TRUE,
levels=c("Low", "Medium", "High", "Very High"))
# Optionally, below is manual conversion examples:
# ASD_National$Male.Prevalence = as.numeric(ASD_National$Male.Prevalence)
# ASD_National$Source = as.factor(ASD_National$Source)
# ASD_National$Prevalence_Risk2 = factor(ASD_National$Prevalence_Risk2, ordered=TRUE, levels=c("Low", "High"))
# ASD_National$Prevalence_Risk4 = factor(ASD_National$Prevalence_Risk4, ordered=TRUE, levels=c("Low", "Medium", "High", "Very High"))
Optionally, export the processed dataframe data to CSV file.
write.csv(ASD_National, file = "../dataset/ADV_ASD_National_R.csv", row.names = FALSE)
# Read back in above saved file:
# ASD_National <- read.csv("../dataset/ADV_ASD_National_R.csv")
# ASD_National$Year_Factor <- factor(ASD_National$Year_Factor, ordered = TRUE) # Convert Year_Factor to ordered.factor
<h3>
Data Summarization - High Level Data Summary
</h3>
summary(ASD_National)
## Source Year Prevalence Upper.CI Lower.CI
## addm: 8 Min. :2000 Min. : 1.800 Min. : 1.800 Min. : 1.700
## medi:13 1st Qu.:2004 1st Qu.: 3.950 1st Qu.: 3.950 1st Qu.: 3.875
## nsch: 4 Median :2008 Median : 6.650 Median : 6.900 Median : 6.350
## sped:17 Mean :2007 Mean : 7.952 Mean : 8.207 Mean : 7.712
## 3rd Qu.:2011 3rd Qu.: 9.725 3rd Qu.:10.350 3rd Qu.: 9.625
## Max. :2016 Max. :29.200 Max. :30.700 Max. :27.700
##
## Source_Full1
## Autism & Developmental Disabilities Monitoring Network: 8
## Medicaid :13
## National Survey of Children's Health : 4
## Special Education Child Count :17
##
##
##
## Source_Full2
## addm-Autism & Developmental Disabilities Monitoring Network: 8
## medi-Medicaid :13
## nsch-National Survey of Children's Health : 4
## sped-Special Education Child Count :17
##
##
##
## Male.Prevalence Male.Lower.CI Male.Upper.CI Female.Prevalence
## Min. :11.50 Min. :12.20 Min. :13.70 Min. :2.700
## 1st Qu.:13.70 1st Qu.:14.85 1st Qu.:16.07 1st Qu.:3.050
## Median :18.40 Median :20.20 Median :21.55 Median :4.000
## Mean :18.71 Mean :19.22 Mean :20.62 Mean :4.271
## 3rd Qu.:23.55 3rd Qu.:22.93 3rd Qu.:24.32 3rd Qu.:5.250
## Max. :26.60 Max. :25.80 Max. :27.40 Max. :6.600
## NA's :35 NA's :36 NA's :36 NA's :35
## Female.Lower.CI Female.Upper.CI Non.hispanic.white.Prevalence
## Min. :2.600 Min. :3.300 Min. : 7.70
## 1st Qu.:3.100 1st Qu.:3.700 1st Qu.: 9.80
## Median :4.300 Median :4.950 Median :12.00
## Mean :4.217 Mean :4.900 Mean :12.51
## 3rd Qu.:4.975 3rd Qu.:5.675 3rd Qu.:15.55
## Max. :6.200 Max. :7.000 Max. :17.20
## NA's :36 NA's :36 NA's :35
## Non.hispanic.white.Lower.CI Non.hispanic.white.Upper.CI
## Min. : 9.100 Min. :10.40
## 1st Qu.: 9.925 1st Qu.:10.93
## Median :13.100 Median :14.20
## Mean :12.733 Mean :13.88
## 3rd Qu.:15.075 3rd Qu.:16.20
## Max. :16.500 Max. :17.80
## NA's :36 NA's :36
## Non.hispanic.black.Prevalence Non.hispanic.black.Lower.CI
## Min. : 6.50 Min. : 6.200
## 1st Qu.: 7.05 1st Qu.: 7.325
## Median :10.20 Median :10.500
## Mean :10.31 Mean :10.200
## 3rd Qu.:12.70 3rd Qu.:12.100
## Max. :16.00 Max. :15.100
## NA's :35 NA's :36
## Non.hispanic.black.Upper.CI Hispanic.Prevalence Hispanic.Lower.CI
## Min. : 7.600 Min. : 5.900 Min. : 5.000
## 1st Qu.: 8.575 1st Qu.: 6.625 1st Qu.: 5.775
## Median :12.000 Median : 9.000 Median : 8.300
## Mean :11.700 Mean : 9.150 Mean : 8.333
## 3rd Qu.:13.700 3rd Qu.:10.625 3rd Qu.: 9.850
## Max. :16.900 Max. :14.000 Max. :13.100
## NA's :36 NA's :36 NA's :36
## Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## Min. : 6.600 Min. : 9.70
## 1st Qu.: 7.775 1st Qu.:10.97
## Median : 9.750 Median :11.85
## Mean :10.017 Mean :11.72
## 3rd Qu.:11.425 3rd Qu.:12.60
## Max. :14.900 Max. :13.50
## NA's :36 NA's :38
## Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## Min. : 8.10 Min. :11.60
## 1st Qu.: 9.45 1st Qu.:12.72
## Median :10.30 Median :13.65
## Mean :10.12 Mean :13.57
## 3rd Qu.:10.97 3rd Qu.:14.50
## Max. :11.80 Max. :15.40
## NA's :38 NA's :38
## Source_UC Source_Full3
## ADDM: 8 ADDM Autism & Developmental Disabilities Monitoring Network: 8
## MEDI:13 MEDI Medicaid :13
## NSCH: 4 NSCH National Survey of Children's Health : 4
## SPED:17 SPED Special Education Child Count :17
##
##
##
## Prevalence_Risk2 Prevalence_Risk4 Year_Factor
## Low :14 Low :14 2004 : 4
## High:28 Medium :18 2008 : 4
## High : 8 2012 : 4
## Very High: 2 2000 : 3
## 2002 : 3
## 2006 : 3
## (Other):21
<h3>
Data Summarization - Summary of <span style="color:blue">numeric</span> variables
</h3>
# Filter only numeric variables/columns
select_if(ASD_National, is.numeric) # library(dplyr)
## # A tibble: 42 x 22
## Year Prevalence Upper.CI Lower.CI Male.Prevalence Male.Lower.CI
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2000 6.7 7 6.3 NA NA
## 2 2002 6.6 6.8 6.3 11.5 NA
## 3 2004 8 8.4 7.6 12.9 12.2
## 4 2006 9 9.3 8.6 14.5 13.9
## 5 2008 11.3 11.7 11 18.4 17.7
## 6 2010 14.7 15.1 14.3 23.7 23
## 7 2012 14.8 15.2 14.4 23.4 22.7
## 8 2014 16.8 17.3 16.4 26.6 25.8
## 9 2004 9.5 12 7.4 NA NA
## 10 2008 16.2 18.1 14.5 NA NA
## # … with 32 more rows, and 16 more variables: Male.Upper.CI <dbl>,
## # Female.Prevalence <dbl>, Female.Lower.CI <dbl>, Female.Upper.CI <dbl>,
## # Non.hispanic.white.Prevalence <dbl>, Non.hispanic.white.Lower.CI <dbl>,
## # Non.hispanic.white.Upper.CI <dbl>, Non.hispanic.black.Prevalence <dbl>,
## # Non.hispanic.black.Lower.CI <dbl>, Non.hispanic.black.Upper.CI <dbl>,
## # Hispanic.Prevalence <dbl>, Hispanic.Lower.CI <dbl>,
## # Hispanic.Upper.CI <dbl>, Asian.or.Pacific.Islander.Prevalence <dbl>,
## # Asian.or.Pacific.Islander.Lower.CI <dbl>,
## # Asian.or.Pacific.Islander.Upper.CI <dbl>
# Data summarization
summary(select_if(ASD_National, is.numeric))
## Year Prevalence Upper.CI Lower.CI
## Min. :2000 Min. : 1.800 Min. : 1.800 Min. : 1.700
## 1st Qu.:2004 1st Qu.: 3.950 1st Qu.: 3.950 1st Qu.: 3.875
## Median :2008 Median : 6.650 Median : 6.900 Median : 6.350
## Mean :2007 Mean : 7.952 Mean : 8.207 Mean : 7.712
## 3rd Qu.:2011 3rd Qu.: 9.725 3rd Qu.:10.350 3rd Qu.: 9.625
## Max. :2016 Max. :29.200 Max. :30.700 Max. :27.700
##
## Male.Prevalence Male.Lower.CI Male.Upper.CI Female.Prevalence
## Min. :11.50 Min. :12.20 Min. :13.70 Min. :2.700
## 1st Qu.:13.70 1st Qu.:14.85 1st Qu.:16.07 1st Qu.:3.050
## Median :18.40 Median :20.20 Median :21.55 Median :4.000
## Mean :18.71 Mean :19.22 Mean :20.62 Mean :4.271
## 3rd Qu.:23.55 3rd Qu.:22.93 3rd Qu.:24.32 3rd Qu.:5.250
## Max. :26.60 Max. :25.80 Max. :27.40 Max. :6.600
## NA's :35 NA's :36 NA's :36 NA's :35
## Female.Lower.CI Female.Upper.CI Non.hispanic.white.Prevalence
## Min. :2.600 Min. :3.300 Min. : 7.70
## 1st Qu.:3.100 1st Qu.:3.700 1st Qu.: 9.80
## Median :4.300 Median :4.950 Median :12.00
## Mean :4.217 Mean :4.900 Mean :12.51
## 3rd Qu.:4.975 3rd Qu.:5.675 3rd Qu.:15.55
## Max. :6.200 Max. :7.000 Max. :17.20
## NA's :36 NA's :36 NA's :35
## Non.hispanic.white.Lower.CI Non.hispanic.white.Upper.CI
## Min. : 9.100 Min. :10.40
## 1st Qu.: 9.925 1st Qu.:10.93
## Median :13.100 Median :14.20
## Mean :12.733 Mean :13.88
## 3rd Qu.:15.075 3rd Qu.:16.20
## Max. :16.500 Max. :17.80
## NA's :36 NA's :36
## Non.hispanic.black.Prevalence Non.hispanic.black.Lower.CI
## Min. : 6.50 Min. : 6.200
## 1st Qu.: 7.05 1st Qu.: 7.325
## Median :10.20 Median :10.500
## Mean :10.31 Mean :10.200
## 3rd Qu.:12.70 3rd Qu.:12.100
## Max. :16.00 Max. :15.100
## NA's :35 NA's :36
## Non.hispanic.black.Upper.CI Hispanic.Prevalence Hispanic.Lower.CI
## Min. : 7.600 Min. : 5.900 Min. : 5.000
## 1st Qu.: 8.575 1st Qu.: 6.625 1st Qu.: 5.775
## Median :12.000 Median : 9.000 Median : 8.300
## Mean :11.700 Mean : 9.150 Mean : 8.333
## 3rd Qu.:13.700 3rd Qu.:10.625 3rd Qu.: 9.850
## Max. :16.900 Max. :14.000 Max. :13.100
## NA's :36 NA's :36 NA's :36
## Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## Min. : 6.600 Min. : 9.70
## 1st Qu.: 7.775 1st Qu.:10.97
## Median : 9.750 Median :11.85
## Mean :10.017 Mean :11.72
## 3rd Qu.:11.425 3rd Qu.:12.60
## Max. :14.900 Max. :13.50
## NA's :36 NA's :38
## Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## Min. : 8.10 Min. :11.60
## 1st Qu.: 9.45 1st Qu.:12.72
## Median :10.30 Median :13.65
## Mean :10.12 Mean :13.57
## 3rd Qu.:10.97 3rd Qu.:14.50
## Max. :11.80 Max. :15.40
## NA's :38 NA's :38
[ Tips ] We notice missing data in a few Prevalence variables.
# Calculate average Prevalence, no error
mean(ASD_National$Prevalence)
## [1] 7.952381
mean(ASD_National$Prevalence[ASD_National$Source == 'addm'])
## [1] 10.9875
mean(ASD_National$Prevalence[ASD_National$Source == 'medi'])
## [1] 4.676923
mean(ASD_National$Prevalence[ASD_National$Source == 'nsch'])
## [1] 19.025
mean(ASD_National$Prevalence[ASD_National$Source == 'sped'])
## [1] 6.423529
# Calculate average Male.Prevalence, there is error!
mean(ASD_National$Male.Prevalence)
## [1] NA
# Because of NA, mean() cannot process, thus we use na.rm to ignore NAs
mean(ASD_National$Male.Prevalence, na.rm = TRUE)
## [1] 18.71429
mean(ASD_National$Female.Prevalence, na.rm = TRUE)
## [1] 4.271429
# Count occurrences of uniques values in a variable/column: number of rows (of data entry) per year
table(ASD_National$Year) # ?table
##
## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
## 3 2 3 2 4 2 3 2 4 2 3 2 4 1 2 1
## 2016
## 2
<h3>
Data Summarization - Summary of <span style="color:blue">categorical</span> variables
</h3>
# List of categorical variables
names(select_if(ASD_National, is.factor)) # All categorical variables are factor data type
## [1] "Source" "Source_Full1" "Source_Full2" "Source_UC"
## [5] "Source_Full3" "Prevalence_Risk2" "Prevalence_Risk4" "Year_Factor"
names(select_if(ASD_National, is.character)) # No categorical variable is character data type
## character(0)
# Look at summary
summary(select_if(ASD_National, is.factor))
## Source Source_Full1
## addm: 8 Autism & Developmental Disabilities Monitoring Network: 8
## medi:13 Medicaid :13
## nsch: 4 National Survey of Children's Health : 4
## sped:17 Special Education Child Count :17
##
##
##
## Source_Full2 Source_UC
## addm-Autism & Developmental Disabilities Monitoring Network: 8 ADDM: 8
## medi-Medicaid :13 MEDI:13
## nsch-National Survey of Children's Health : 4 NSCH: 4
## sped-Special Education Child Count :17 SPED:17
##
##
##
## Source_Full3
## ADDM Autism & Developmental Disabilities Monitoring Network: 8
## MEDI Medicaid :13
## NSCH National Survey of Children's Health : 4
## SPED Special Education Child Count :17
##
##
##
## Prevalence_Risk2 Prevalence_Risk4 Year_Factor
## Low :14 Low :14 2004 : 4
## High:28 Medium :18 2008 : 4
## High : 8 2012 : 4
## Very High: 2 2000 : 3
## 2002 : 3
## 2006 : 3
## (Other):21
summary(select_if(ASD_National, is.character))
## < table of extent 0 x 0 >
# Count occurrences of uniques values in a variable/column
table(ASD_National$Source)
##
## addm medi nsch sped
## 8 13 4 17
table(ASD_National$Source_Full3)
##
## ADDM Autism & Developmental Disabilities Monitoring Network
## 8
## MEDI Medicaid
## 13
## NSCH National Survey of Children's Health
## 4
## SPED Special Education Child Count
## 17
table(ASD_National$Year_Factor)
##
## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
## 3 2 3 2 4 2 3 2 4 2 3 2 4 1 2 1
## 2016
## 2
table(ASD_National$Prevalence) # numeric is also possible
##
## 1.8 2.1 2.3 2.6 2.8 3 3.5 3.6 3.9 4.1 4.4 4.8 5.1 5.4 5.6 5.9
## 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1
## 6.2 6.4 6.6 6.7 7 7.1 7.7 8 8.2 8.4 9 9.1 9.5 9.8 10.5 11.2
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 11.3 11.9 14.7 14.8 16.2 16.8 21.2 29.2
## 1 1 1 1 1 1 1 1
# Display unique values (levels) of a factor categorical
lapply(select_if(ASD_National, is.factor), levels)
## $Source
## [1] "addm" "medi" "nsch" "sped"
##
## $Source_Full1
## [1] "Autism & Developmental Disabilities Monitoring Network"
## [2] "Medicaid"
## [3] "National Survey of Children's Health"
## [4] "Special Education Child Count"
##
## $Source_Full2
## [1] "addm-Autism & Developmental Disabilities Monitoring Network"
## [2] "medi-Medicaid"
## [3] "nsch-National Survey of Children's Health"
## [4] "sped-Special Education Child Count"
##
## $Source_UC
## [1] "ADDM" "MEDI" "NSCH" "SPED"
##
## $Source_Full3
## [1] "ADDM Autism & Developmental Disabilities Monitoring Network"
## [2] "MEDI Medicaid"
## [3] "NSCH National Survey of Children's Health"
## [4] "SPED Special Education Child Count"
##
## $Prevalence_Risk2
## [1] "Low" "High"
##
## $Prevalence_Risk4
## [1] "Low" "Medium" "High" "Very High"
##
## $Year_Factor
## [1] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [11] "2010" "2011" "2012" "2013" "2014" "2015" "2016"
# or using variable names
lapply(ASD_National[c('Source_UC', 'Year_Factor')], levels)
## $Source_UC
## [1] "ADDM" "MEDI" "NSCH" "SPED"
##
## $Year_Factor
## [1] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [11] "2010" "2011" "2012" "2013" "2014" "2015" "2016"
# Pivot of counting occurrences
table(ASD_National$Source_Full3, ASD_National$Year) # table(ASD_National$Year, ASD_National$Source_Full3)
##
## 2000 2001 2002
## ADDM Autism & Developmental Disabilities Monitoring Network 1 0 1
## MEDI Medicaid 1 1 1
## NSCH National Survey of Children's Health 0 0 0
## SPED Special Education Child Count 1 1 1
##
## 2003 2004 2005
## ADDM Autism & Developmental Disabilities Monitoring Network 0 1 0
## MEDI Medicaid 1 1 1
## NSCH National Survey of Children's Health 0 1 0
## SPED Special Education Child Count 1 1 1
##
## 2006 2007 2008
## ADDM Autism & Developmental Disabilities Monitoring Network 1 0 1
## MEDI Medicaid 1 1 1
## NSCH National Survey of Children's Health 0 0 1
## SPED Special Education Child Count 1 1 1
##
## 2009 2010 2011
## ADDM Autism & Developmental Disabilities Monitoring Network 0 1 0
## MEDI Medicaid 1 1 1
## NSCH National Survey of Children's Health 0 0 0
## SPED Special Education Child Count 1 1 1
##
## 2012 2013 2014
## ADDM Autism & Developmental Disabilities Monitoring Network 1 0 1
## MEDI Medicaid 1 0 0
## NSCH National Survey of Children's Health 1 0 0
## SPED Special Education Child Count 1 1 1
##
## 2015 2016
## ADDM Autism & Developmental Disabilities Monitoring Network 0 0
## MEDI Medicaid 0 0
## NSCH National Survey of Children's Health 0 1
## SPED Special Education Child Count 1 1
# Pivot of counting occurrences
table(ASD_National$Prevalence_Risk2, ASD_National$Source)
##
## addm medi nsch sped
## Low 0 7 0 7
## High 8 6 4 10
# Pivot of counting occurrences
table(ASD_National$Prevalence_Risk4, ASD_National$Source)
##
## addm medi nsch sped
## Low 0 7 0 7
## Medium 4 6 1 7
## High 4 0 1 3
## Very High 0 0 2 0
# library(repr)
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
<h3>
Data Visualisation (Base Graphic) - Histogram (distribution of binned continuous variable)
</h3>
https://www.statmethods.net/graphs/density.html
hist(ASD_National$Prevalence)
par(mfrow=c(1, 2)) # multiple plots on one page: row split to: 1,column split to: 2
hist(ASD_National$Male.Prevalence)
hist(ASD_National$Female.Prevalence)
par(mfrow=c(1, 1)) # Reset to one plot on one page
# Histogram with annotations
hist(ASD_National$Prevalence,
main = "Frequency of National ASD Prevalence", # Chart title
xlab = "Prevalence per 1,000 Children", # x axis label
ylab = "Frequency or Occurrences",# y axis label
sub = "Year 2000 - 2016", # Chart subtitle at bottom
col.main="blue", col.lab="black", col.sub="darkgrey") # Colours
<h3>
Density plot (distribution for continuous variable normalized to 100% area under curve)
</h3>
https://www.statmethods.net/graphs/density.html
par(mfrow=c(1, 2)) # multiple plots on one page: row split to: 1,column split to: 2
plot(density(ASD_National$Prevalence))
# Density plot with annotations
plot(density(ASD_National$Prevalence),
main = "Density of National ASD Prevalence",
xlab = "Prevalence per 1,000 Children",
ylab = "Frequency or Occurrences",
sub = "Year 2000 - 2016",
col.main="blue", col.lab="black", col.sub="darkgrey")
par(mfrow=c(1, 1))
<h3>
Boxplot plot (median, 25% quantile,75% quantile)
</h3>
https://www.statmethods.net/graphs/boxplot.html
https://stats.stackexchange.com/questions/156778/percentile-vs-quantile-vs-quartile
0 quartile = 0 quantile = 0 percentile
1 quartile = 0.25 quantile = 25 percentile
2 quartile = .5 quantile = 50 percentile (median)
3 quartile = .75 quantile = 75 percentile
4 quartile = 1 quantile = 100 percentile
par(mfrow=c(1, 2)) # multiple plots on one page: row split to: 1,column split to: 2
# All children prevalence with and without 95% confidence side by side:
boxplot(ASD_National$Prevalence, notch = TRUE) # 95% confidence interval - a notch is drawn in each side of the boxes. If the notches of two plots do not overlap this is ‘strong evidence’ that the two medians differ
boxplot(ASD_National$Prevalence) # All children
par(mfrow=c(1, 1))
par(mfrow=c(1, 2)) # multiple plots on one page: row split to: 1,column split to: 2
# Male prevalence and Female prevalence side by side:
boxplot(ASD_National$Male.Prevalence, ylim = c(0, 35), notch = TRUE) # Male children
## Warning in bxp(list(stats = structure(c(11.5, 13.7, 18.4, 23.55, 26.6), .Dim =
## c(5L, : some notches went outside hinges ('box'): maybe set notch=FALSE
boxplot(ASD_National$Female.Prevalence, ylim = c(0, 35), notch = TRUE) # Female children
## Warning in bxp(list(stats = structure(c(2.7, 3.05, 4, 5.25, 6.6), .Dim = c(5L, :
## some notches went outside hinges ('box'): maybe set notch=FALSE
par(mfrow=c(1, 1))
# Display value ranges
# numeric:
range(ASD_National$Prevalence)
## [1] 1.8 29.2
range(ASD_National$Year)
## [1] 2000 2016
# categorical:
min(ASD_National$Year_Factor)
## [1] 2000
## 17 Levels: 2000 < 2001 < 2002 < 2003 < 2004 < 2005 < 2006 < 2007 < ... < 2016
max(ASD_National$Year_Factor)
## [1] 2016
## 17 Levels: 2000 < 2001 < 2002 < 2003 < 2004 < 2005 < 2006 < 2007 < ... < 2016
# Create 'Prevalence' box plots break by 'Source'
boxplot(ASD_National$Prevalence ~ ASD_National$Source,
main = "National ASD Prevalence by Data Source",
xlab = "Data Source",
ylab = "Prevalence per 1,000 Children",
sub = "Year 2000 - 2016",
col.main="blue", col.lab="black", col.sub="darkgrey")
<h3>
Quiz:
</h3>
<p>
Set noth=TRUE to above boxplot. Are there overlapping among four data sources?
</p>
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
<h3>
Data Visualisation (Base Graphic) - Bar plot
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
# ----------------------------------
# [National] Risk by Data Source
# ----------------------------------
# Create bar chart using R graphics
counts = table(ASD_National$Prevalence_Risk2, ASD_National$Source)
#counts = table(ASD_National$Source, ASD_National$Prevalence_Risk4)
barplot(counts,
main="Prevalence by Data Sources and Risk Levels",
xlab="Data Sources", col=c("white", "lightgrey"),
ylab="Occurrences",
legend = rownames(counts),
args.legend = list(x="topleft", bty = "n", cex = 0.85, y.intersp=2))
# ----------------------------------
# [National] Risk by Data Source
# ----------------------------------
# Create bar chart using R graphics
counts = table(ASD_National$Prevalence_Risk2, ASD_National$Source) # Count of Risk records, split by Source
barplot(counts,
main="Prevalence by Data Sources and Risk Levels",
xlab="Data Sources",
ylab="Occurrences",
col=c("white", "lightgrey"),
legend = rownames(counts),
args.legend = list(x = "topleft", bty = "n", cex = 0.85, y.intersp = 2))
# ----------------------------------
# [National] Risk by Data Source
# ----------------------------------
# Create bar chart using R graphics
counts = table(ASD_National$Prevalence_Risk4, ASD_National$Source) # Count of Risk records, split by Source
barplot(counts,
main="Prevalence Occurrence by Source and Risk",
xlab="Data Sources",
ylab="Occurrences",
col=c("lightyellow", "orange", "red","darkred"),
legend = rownames(counts),
args.legend = list(x = "topleft", bty = "n", cex = 0.85, y.intersp = 2))
<h3>
Data Visualisation (Base Graphic) - Line chart
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=5)
# ----------------------------------
# [National] < Prevalence has changed over Time >
# ----------------------------------
# Prevalence over Year
# Use Year as x-axis: y value Prevalence is NOT aggregated for different data sources
plot(ASD_National$Year, ASD_National$Prevalence)
# Use Year_factor as x-axis: y value Prevalence is aggregated for different data sources
plot(ASD_National$Year_Factor, ASD_National$Prevalence)
# table(ASD_National$Source_Full3)
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=6)
par(mfrow=c(2, 2))
# Prevalence over Year, from data source:
# addm-Autism & Developmental Disabilities Monitoring Network
plot(ASD_National$Year[ASD_National$Source == 'addm'],
ASD_National$Prevalence[ASD_National$Source == 'addm'])
# Prevalence over Year, from data source:
# medi-Medicaid
plot(ASD_National$Year[ASD_National$Source == 'medi'],
ASD_National$Prevalence[ASD_National$Source == 'medi'])
# Prevalence over Year, from data source:
# nsch-National Survey of Children Health
plot(ASD_National$Year[ASD_National$Source == 'nsch'],
ASD_National$Prevalence[ASD_National$Source == 'nsch'])
# Prevalence over Year, from data source:
# sped-Special Education Child Count
plot(ASD_National$Year[ASD_National$Source == 'sped'],
ASD_National$Prevalence[ASD_National$Source == 'sped'])
par(mfrow=c(1, 1)) # Reset to one plot on one page
# ----------------------------------
# Add more annotations to above plots
# ----------------------------------
# Color list
# addm : darkblue
# medi : orange
# nsch : darkred
# sped : skyblue
par(mfrow=c(2, 2))
# Prevalence over Year, from data source:
# addm-Autism & Developmental Disabilities Monitoring Network
plot(ASD_National$Year[ASD_National$Source == 'addm'],
ASD_National$Prevalence[ASD_National$Source == 'addm'],
type="l", # dot/point type
lty=1, # line type
lwd=3, # line width
col="darkblue", # line color
xlab="Year",
ylab="Prevalence per 1,000 Children",
ylim = c(0, 30), # Set value range of y axis
main="[addm] Prevalence Estimates Over Time",
sub = "zhan.gu@nus.edu.sg",
col.main="blue", col.lab="black", col.sub="darkgrey")
# Prevalence over Year, from data source:
# medi-Medicaid
plot(ASD_National$Year[ASD_National$Source == 'medi'],
ASD_National$Prevalence[ASD_National$Source == 'medi'],
type="b", lty=1, lwd=3, col="orange",
xlab="Year",
ylab="Prevalence per 1,000 Children",
ylim = c(0, 30), # Set value range of y axis
main="[medi] Prevalence Estimates Over Time",
sub = "zhan.gu@nus.edu.sg",
col.main="blue", col.lab="black", col.sub="darkgrey")
# Prevalence over Year, from data source:
# nsch-National Survey of Children Health
plot(ASD_National$Year[ASD_National$Source == 'nsch'],
ASD_National$Prevalence[ASD_National$Source == 'nsch'],
type="l", lty=2, lwd=3, col="darkred",
xlab="Year",
ylab="Prevalence per 1,000 Children",
ylim = c(0, 30), # Set value range of y axis
main="[nsch] Prevalence Estimates Over Time",
sub = "zhan.gu@nus.edu.sg",
col.main="blue", col.lab="black", col.sub="darkgrey")
# Prevalence over Year, from data source:
# sped-Special Education Child Count
plot(ASD_National$Year[ASD_National$Source == 'sped'],
ASD_National$Prevalence[ASD_National$Source == 'sped'],
type="l", lty=3, lwd=3, col="skyblue",
xlab="Year",
ylab="Prevalence per 1,000 Children",
ylim = c(0, 30), # Set value range of y axis
main="[sped] Prevalence Estimates Over Time",
sub = "zhan.gu@nus.edu.sg",
col.main="blue", col.lab="black", col.sub="darkgrey")
par(mfrow=c(1, 1)) # Reset to one plot on one page
<h3>
Data Visualisation (Base Graphic) - <span style="color:blue">[ R ] REPORTED PREVALENCE HAS CHANGED OVER TIME</span> by [ Data Source ]
</h3>
Create multiple lines within a single chart
# ----------------------------------
# [National] < Prevalence Varies over Time/Year by Data Source >
# ----------------------------------
# Create a first line
plot(ASD_National$Year[ASD_National$Source == 'addm'],
ASD_National$Prevalence[ASD_National$Source == 'addm'],
col = "darkblue", lty = 1, lwd = 2,
type = "b", # use dot/point
pch = 0, # dot/point type: http://www.endmemo.com/program/R/pchsymbols.php
xlab="Year",
xlim=c(2000, 2016), # Set x axis value range
ylab="Prevalence per 1,000 Children",
ylim=c(0, 30), # Set y axis value range
main="Prevalence Estimates Over Time by Data Source",
col.main="black", col.lab="black", col.sub="grey",
frame = FALSE, # Remove frame
axes=FALSE # Remove x and y axis
)
axis(1, at=seq(2000, 2016, 1)) # Customize x axis
axis(2, at=seq(0, 30, 5)) # Customize y axis
# Add another line
lines(ASD_National$Year[ASD_National$Source == 'medi'],
ASD_National$Prevalence[ASD_National$Source == 'medi'],
pch = 1, col = "orange", type = "b", lty = 1, lwd = 2
)
# Add another line
lines(ASD_National$Year[ASD_National$Source == 'nsch'],
ASD_National$Prevalence[ASD_National$Source == 'nsch'],
pch = 2, col = "darkred", type = "b", lty = 1, lwd = 2
)
# Add another line
lines(ASD_National$Year[ASD_National$Source == 'sped'],
ASD_National$Prevalence[ASD_National$Source == 'sped'],
pch = 5, col = "skyblue", type = "b", lty = 1, lwd = 2
)
# Add a legend to the plot
legend("topleft", legend=levels(ASD_National$Source),
col=c("darkblue", "orange", "darkred", "skyblue"),
pch = 20, # dot in a line
lty = 1, # line type
lwd = 2, # line width
cex=0.8, # size of text
bty = 'n' # Without frame
)
R pch: dot/point type: http://www.endmemo.com/program/R/pchsymbols.php
R plot colour list: https://www.r-graph-gallery.com/42-colors-names.html
<h3>
Data Visualisation (Base Graphic) - <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY SEX</span> [ Source: ADDM ] over [ Year ]
</h3>
# ----------------------------------
# [addm] < Prevalence Varies by Sex >
# ----------------------------------
# Create a first line
plot(ASD_National$Year[ASD_National$Source == 'addm'],
ASD_National$Prevalence[ASD_National$Source == 'addm'],
col = "grey", lty = 1, lwd = 2,
type = "l", # use dot/point
pch = 0, # dot/point type: http://www.endmemo.com/program/R/pchsymbols.php
xlab="Year",
xlim=c(2000, 2016), # Set x axis value range
ylab="Prevalence per 1,000 Children",
ylim=c(0, 30), # Set y axis value range
main="Prevalence Estimates by Sex [ADDM]",
col.main="black", col.lab="black", col.sub="grey",
frame = FALSE, # Remove frame
axes=FALSE # Remove x and y axis
)
axis(1, at=seq(2000, 2016, 1)) # Customize x axis
axis(2, at=seq(0, 30, 5)) # Customize y axis
# Add Female prevalence
lines(ASD_National$Year[ASD_National$Source == 'addm'],
ASD_National$Female.Prevalence[ASD_National$Source == 'addm'],
pch = 1, col = "orange", type = "l", lty = 1, lwd = 2)
# Add Female prevalence lower CI
lines(ASD_National$Year[ASD_National$Source == 'addm'],
ASD_National$Female.Lower.CI[ASD_National$Source == 'addm'],
pch = 1, col = "orange", type = "l", lty = 3, lwd = 1)
# Add Female prevalence upper CI
lines(ASD_National$Year[ASD_National$Source == 'addm'],
ASD_National$Female.Upper.CI[ASD_National$Source == 'addm'],
pch = 1, col = "orange", type = "l", lty = 3, lwd = 1)
# Add Male prevalence
lines(ASD_National$Year[ASD_National$Source == 'addm'],
ASD_National$Male.Prevalence[ASD_National$Source == 'addm'],
pch = 1, col = "blue", type = "l", lty = 1, lwd = 2)
# Add Male prevalence lower CI
lines(ASD_National$Year[ASD_National$Source == 'addm'],
ASD_National$Male.Lower.CI[ASD_National$Source == 'addm'],
pch = 1, col = "blue", type = "l", lty = 3, lwd = 1)
# Add Male prevalence upper CI
lines(ASD_National$Year[ASD_National$Source == 'addm'],
ASD_National$Male.Upper.CI[ASD_National$Source == 'addm'],
pch = 1, col = "blue", type = "l", lty = 3, lwd = 1)
# Add a legend to the plot
legend("topleft", legend=c('ADDM Average', 'Female with 95% CI', 'Male with 95% CI'),
col=c("grey", "orange", "blue"),
# pch = 20, # dot in a line
lty = 1, # line type
lwd = 2, # line width
cex=0.8, # size of text
bty = 'n' # Without frame
)
<h3>
Data Visualisation (Base Graphic) - <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY RACE AND ETHNICITY</span> [ Source: ADDM ]
</h3>
# ----------------------------------
# [addm] < Prevalence Varies by Race and Ethnicity >
# ----------------------------------
# Create a first line
plot(ASD_National$Year[ASD_National$Source == 'addm'],
ASD_National$Prevalence[ASD_National$Source == 'addm'],
col = "grey", lty = 1, lwd = 2,
type = "l", # use dot/point
pch = 0, # dot/point type: http://www.endmemo.com/program/R/pchsymbols.php
xlab="Year",
xlim=c(2000, 2016), # Set x axis value range
ylab="Prevalence per 1,000 Children",
ylim=c(0, 30), # Set y axis value range
main="Prevalence Estimates by Race/Ethnicity [ADDM]",
col.main="black", col.lab="black", col.sub="grey",
frame = FALSE, # Remove frame
axes=FALSE # Remove x and y axis
)
axis(1, at=seq(2000, 2016, 1)) # Customize x axis
axis(2, at=seq(0, 30, 5)) # Customize y axis
# R plot colour list: https://www.r-graph-gallery.com/42-colors-names.html
# Add Asian.or.Pacific.Islander.Prevalence
lines(ASD_National$Year[ASD_National$Source == 'addm'],
ASD_National$Asian.or.Pacific.Islander.Prevalence[ASD_National$Source == 'addm'],
pch = 20, col = "darkred", type = "b", lty = 1, lwd = 2)
# Add Hispanic.Prevalence
lines(ASD_National$Year[ASD_National$Source == 'addm'],
ASD_National$Hispanic.Prevalence[ASD_National$Source == 'addm'],
pch = 20, col = "darkorchid3", type = "b", lty = 1, lwd = 2)
# Add Non.hispanic.black.Prevalence
lines(ASD_National$Year[ASD_National$Source == 'addm'],
ASD_National$Non.hispanic.black.Prevalence[ASD_National$Source == 'addm'],
pch = 20, col = "deepskyblue3", type = "b", lty = 1, lwd = 2)
# Add Non.hispanic.white.Prevalence
lines(ASD_National$Year[ASD_National$Source == 'addm'],
ASD_National$Non.hispanic.white.Prevalence[ASD_National$Source == 'addm'],
pch = 20, col = "chartreuse3", type = "b", lty = 1, lwd = 2)
# Add a legend to the plot
legend("topleft", legend=c('ADDM Average',
'Non-Hispanic White',
'Non-Hispanic Black',
'Hispanic',
'Asian/Pacific Islander'),
col=c("grey", "chartreuse3", "deepskyblue3", "darkorchid3", "darkred"),
pch = 20, # dot in a line
lty = 1, # line type
lwd = 2, # line width
cex=0.8, # size of text
bty = 'n' # Without frame
)
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
<h3>
Quiz:
</h3>
<p>
Add 95% Confidence Interval to above plot
</p>
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
<h3>
Quiz:
</h3>
<p>
Use talbe() to count No. prevalence records for each Data Source. Then use barplot() to visualize.
</p>
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
<h3>
Quiz:
</h3>
<p>
Which Data Sources are available in which years?
</p>
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
<h3>
Quiz:
</h3>
<p>
Which Data Source has breakdown Prevalvence data by sex/gender?
</p>
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
<h3>
Quiz:
</h3>
<p>
Which Data Source has breakdown Prevalvence data by race and ethnicity?
</p>
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
Connect with the author:
This notebook was written by GU Zhan (Sam).
Sam is currently a lecturer in Institute of Systems Science in National University of Singapore. He devotes himself into pedagogy & andragogy, and is very passionate in inspiring next generation of artificial intelligence lovers and leaders.
Copyright © 2020 GU Zhan
This notebook and its source code are released under the terms of the MIT License.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
<a href="">
</a>
<h3>
Interactive workshops: < Learning R inside R > using swirl() (in R/RStudio)
</h3>
https://github.com/telescopeuser/S-SB-Workshop
<h3>
Neural Network 101 using nnet()
</h3>
Use nerual net to classify three different species of iris flowers, based on four features/measurements of: * length of the petals * width of the petals * length of the sepals * width of the sepals
# ----------------------------------
# Neural Network 101 using nnet()
# ----------------------------------
if(!require(nnet)){install.packages("nnet")}
## Loading required package: nnet
library("nnet")
# ?nnet
# < Case: predict three different iris flower types >
# https://en.wikipedia.org/wiki/Iris_flower_data_set
# https://archive.ics.uci.edu/ml/datasets/iris
# Data preparation: split iris data in two halves, for training & testing respectively.
ir <- rbind(iris3[,,1],iris3[,,2],iris3[,,3])
targets <- class.ind( c(rep("setosa", 50), rep("versicolor", 50), rep("virginica", 50)) )
samp <- c(sample(1:50,25), sample(51:100,25), sample(101:150,25))
# Model training (machine learning / data fitting)
ir1 <- nnet(ir[samp,], targets[samp,], size = 2, rang = 0.1,
decay = 5e-4, maxit = 200)
## # weights: 19
## initial value 56.322878
## iter 10 value 26.168810
## iter 20 value 17.958623
## iter 30 value 0.578242
## iter 40 value 0.523268
## iter 50 value 0.494103
## iter 60 value 0.484389
## iter 70 value 0.481326
## iter 80 value 0.479536
## iter 90 value 0.477770
## iter 100 value 0.477145
## iter 110 value 0.476933
## iter 120 value 0.476866
## iter 130 value 0.476813
## iter 140 value 0.476791
## iter 150 value 0.476789
## iter 160 value 0.476789
## final value 0.476788
## converged
# Model evaluation function
test.cl <- function(true, pred) {
true <- max.col(true)
cres <- max.col(pred)
table(true, cres)
}
# Model evaluation
test.cl(targets[-samp,], predict(ir1, ir[-samp,]))
## cres
## true 1 2 3
## 1 25 0 0
## 2 0 22 3
## 3 0 0 25
<a href="https://github.com/dd-consulting">
<img src="../reference/GZ_logo.png" width="60" align="right">
</a>